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Abstract 

In proteomics 2-dimensional gel electrophoresis (2-DE) is a separation technique for proteins. 
The resulting protein spots can be identified by either using picking robots and subsequent mass 
spectrometry or by visual cross inspection of a new gel image with an already analyzed master 
gel. Difficulties especially arise from inherent noise and irregular geometric distortions in 2-DE 
images. Aiming at the automated analysis of large series of 2-DE images, or at the even more 
difficult interlaboratory gel comparisons, the bottleneck is to solve the two most basic algorithmic 
problems with high quality: Identifying protein spots and computing a matching between two 
images. For the development of the analysis software CAROL at Freie Universitat Berlin we 
have reconsidered these two problems and obtained new solutions which rely on methods from 
computational geometry. Their novelties are: 1. Spot detection is also possible for complex 
regions formed by several ''merged" (usually saturated) spots; 2. User-defined landmarks are 
not necessary for the matching. Furthermore, images for comparison are allowed to represent 
different parts of the entire protein pattern, which only partially "overlap". The implementation is 
done in a client server architecture to allow queries via the Internet. We also discuss and point at 
related theoretical questions in computational geometry. 

1 Introduction 

The proteomic research deals with the systematic analysis of complete profiles of the proteins ex- 
pressed in a given cell, tissue or biological system at a given time. In this field, 2D-gel electrophoresis 
(2-DE) is a well-established and widely used technique to separate proteins in a sample. A 2-DE gel 
is the product of two sequentially performed separations in acrylamide gel media: isoelectric focus- 
ing as first dimension and a separation by molecular weight as second dimension. The result of that 
process is a 2D pattern of spots each representing a protein. There is a variety of staining methods to 
display protein spots: Coomassie blue and silver-staining and / or fluorescent and radioactive label- 
ing. Mass spectrometry and the visual (usually computer aided) comparison (matching) with already 
analyzed gel images of such a sample are used to identify proteins. The visual analysis of such 2- 
DE image series intends to identify those proteins that change their expression (size, intensity) and 
reflect/cause certain biochemical and biomedical conditions of an organism, see [26]. However, this 
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requires high throughput analysis tools and the major challenge is to obtain both robust and reliable 
algorithmic solutions that work automatically, or at least need only little user interaction. As we 
will see, this goal in conjunction with efficiency requirements makes it necessary to start with simple 
(sometimes oversimplified) modelling assumptions. 

A second strand of research was initiated in [22]. Here the challenge consists of creating analysis 
software that is able to perform interlaboratory gel image comparisons and to deal with images from 
various databases via the Internet. 

The intensive research on these questions emphasizes the strong demand from the practitioners 
for better algorithmic solutions. 

The aim of this paper is to demonstrate how ideas from computational geometry help to meet 
these challenges in the context of the two most basic problems in gel analysis: 

• Spot detection: Given a scanned 2-DE gel image identify spot regions. 

• Gel matching: Given two images by their spot lists, perform a matching procedure to identify 
those spot pairs that correspond to each other, so that the matching reflects both geometric and 
spot intensity resemblance of the images. 

In this paper we survey algorithmic studies and implementation work that was done while developing 
the 2-DE analysis software system CAROL ([9]). It contains both new results (Section 2, Subsection 
3.4) and partly published ones (Section 3, [17]). Interestingly, this application problem has also lead 
to new theoretical questions in geometry. 

In Section 2 we address the spot detection question in the following context. Originally the 
CAROL project was designed to answer only matching queries for gel images. It turned out that 
available spot detection algorithms did not adequately support the underlying philosophy that is based 
on the similarity of geometric point patterns. Thus the spot detection approach presented here aims 
at determining the point pattern of a gel as precise as possible rather than computing exact shapes 
and volume data of spots. We assume that each spot has approximately the shape of an axis-parallel 
ellipse. This is a widely accepted, but surely simplified, modeling assumption, see for example [5] or 
[14]. However, spots that are very close to each other may partially overlap and "merge". This can 
lead to rather complicated local pixel regions, as depicted in Figure 1, compare [20], 

In [23] a spot detection algorithm is described that relies on a watershed transformation applied 
to the gradient image. The most difficult part is the interpretation of twin spots, streaks (left side 
in Fig.l), and so-called complex regions (right side in Fig. 1) as unions of ellipses. This especially 
applies to the case of oversaturated regions in silver stained gels. 




Figure 1 : Twin spots, streaks, and a complex region 



In fact, in [23] the latter case of interpreting complex regions was left open and in the implementation 
the user had to edit these complex regions by hand. Here we present an algorithmic solution to this 



question based on a Linear Programming (LP) formulation. To the best of our knowledge it is the first 
algorithmic attempt to deal with spot detection of complex oversaturated regions in a single image. 
An alternative approach has been studied in [6], where the information of multiple already aligned 
images is used for a subsequent spot detection. 

There is an inherent uncertainty in the images due to the electrophoresis process itself which is 
highly susceptible to faults and geometric distortions. The reader may argue just by looking at the 
images, that there is no hope to come up with a perfect spot detection algorithm. This is certainly true. 
Yet we can cope with this ambiguity, at least to some extent by computing a ranked list of proposals 
how a complex region could be covered instead of computing only the "best" covering, which could 
be erroneous. Then, in the matching algorithm, we have the possibility to accept the matching of two 
ambiguous regions if there is a pair of proposed coverings that match. 

The matching algorithm itself is described in Section 3. Assume that a source and a target image 
are given by their spot lists. We first transform the lists into geometric point patterns, where a spot is 
represented by the ellipse center coordinates and additionally we code the real size/intensity of the spot 
into a single integer value. The lists can be of significantly different size and, additionally, we allow 
the images to show only partially overlapping sections of the gels. Thus the task is to find a maximal- 
cardinality one-to-one matching of the spot lists, so that this mapping respects both the geometry of 
the images and the relative spot intensities i.e., relatively intensive source spots should be mapped to 
relatively intensive target spots. The geometric matching criterion is based on similarity of imaginary 
edges which link spots. Two edges are similar if their length ratio and the difference of their angles 
formed with the x-axis are within preset tolerance bounds. Our algorithmic solution, which we call 
global-via-local matching, works in two stages. Firstly, we compute matchings for several small local 
patterns, which are chosen in a grid-like fashion. For this local matching we use a modified alignment 
method known from point pattern matching ([2]) combined with geometric hashing. A significant 
speed-up is gained by reducing the set of all source / target edges to those belonging to the history 
of the incremental Delaunay triangulation of the spot lists in the order of their decreasing intensities. 
Having sets of matching candidates for each of the local patterns, we select a subset that is consistent 
with the overall grid topology. The matching spot pairs computed this way later serve as "landmarks" 
in the second stage. Landmarks were also used in previous algorithmic solutions, but they had to 
be set manually. The novelty of our solution is to generate them automatically. The subsequent 
extension from landmarks to a maximal global matching is rather standard by local neighborhood 
comparisons. Since this step uses mainly geometric information, it is capable to match spots with 
different intensities, that correspond to significantly regulated proteins. Moreover in the extension 
step it is possible to exploit the proposal lists for ambigious spot regions, as computed in the spot 
detection. This way, we present a first step towards the simulation of how an expert would compare 
images by visual inspection, namely, by intertwining the spot detection with the matching process. 

Figure 2 shows two 2-DE gel images originated from a transfection experiment of cultured EaHy 
cells with 800 (left) and 1000 protein spots. Their original size is about 23cm by 29cm. For the 
purpose of illustration in Figure 3 and 4 more details are shown of the small rectangular window 
regions marked in Figure 2. (For simplicity ellipses were replaced by circles.) 

In both Sections 2 and 3 we also refer to related theoretical issues and open problems in compu- 
tational geometry that could further improve our solution. In Section 4 we outline implementation 
issues and describe experimental results. 



3 



Figure 2: Two gel images of cultured EaHy cells from a transfection experiment 




Figure 3: Detailed local images of Fig.2, a selected pattern on the left side and a partial matching 



2 Spot Detection in Complex Regions as an Ellipse Covering Problem 
2.1 Modeling of the Core Problem 

Spot detection is to a wide extent an image processing problem. In fact, if the spots are well sepa- 
rated, classical methods like filtering and watershed transformation on the gradient image provide a 
satisfying algorithmic solution, see [23]. As mentioned above, the core combinatorial problem is the 
interpretation of complex pixel regions as unions of axis parallel ellipses. 

The following formalization is a compromise stemming from discussions with practitioners who 
solve these covering instances by peer review. First, we remark that complex regions are typically 
saturated (black), and therefore gray level information does not help for the detection. We assume that 
a connected pixel pattern R is given, which is fat in the sense that there are neither short horizontal 
nor vertical cuts consisting of only two pixels. In the application R is usually a simply connected 
subpattern of a 100 x 100-pixel square. 
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Figure 4: Spot sets detected with black spots indicating the local matching 



To make use of the powerful machinery of geometric approximation algorithms we identify pixels 
with their center points. Then the region R can be represented by two sets of points C and F, where 
C is a sample of points to be covered (inside R) and F is a set of forbidden points (outside R). One 
can obtain these sets walking along the boundary of R and choosing points inside and outside within 
a small distance. This approach somehow mimics the general practice of experts who are looking 
for ellipses approximating long parts of the boundary of R. The advantage that both sets are of small 
cardinality has to be paid for by the fact that the computed cover could have some hole in the interior of 
the region. To avoid this one also can choose C as the set of all grid points in R (from an appropriately 
dense grid). Then, of course, the cardinality m of C can be quadratic in n, the size of F. 

Now we can formulate the approximate covering problem from the application point of view. For 
numbers 0 < S, e < 1 the problem is to find a smallest-cardinality set S of axis-parallel ellipses 
fulfilling the following conditions. 

1. (Fitting) Each ellipse E £ £ respects F, i.e., it does not intersect F. 

2. (Intersection) The boundaries of every pair of ellipses E, E' in S intersects in at most 2 points, 
and area(£ n E') < S ■ min{area(E),area(J5')}- 

3. (Covering) \Jezs e covers at least a (1 - e)-portion of pixels in C. 

The aim to find minimal-cardinality coverings is in accordance with Occam's razor principle, since 
the smallest cardinality set is, in absence of a master gel, the simplest hypothesis to explain how 
a complex region could have been evolved. At first sight one would require to cover all points in 
R, however in order to cope with noise the covering condition allows to leave a small portion of 
points uncovered. The fitting condition ensures that the selected ellipses do not cover much more 
than the region R. From the way the gel production process spreads a protein in the two dimensions, 
practitioners usually conclude that two spots do not form a cross. We incorporated this notion in the 
intersection condition. 

Following the standard methodology of set cover approximation, our ellipse covering problem 
splits into two parts: 

1. Find a basic set E of ellipses such that each ellipse E <E E fulfills the fitting property, and 

RQUe^E. 
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Figure 5: Ellipse covering computed by brute force method (left) and by LP approach (right) 



2. Select a minimum cardinality subset £ C E which respects the intersection and covering prop- 
erties. 

Remark that (the boundary of) an axis-parallel ellipse can be represented by the solutions of the 
equation 

( ^ + t^-l = 0 (1) 
a 2 b 2 K } 

with parameters a,b,c,d G 2R, where (c,d) is the ellipse center, and a and b are the lengths of 
the halfaxes. Thus a straightforward approach is to discretize the parameter space of all possible 
ellipses, and define E to be the set of all inclusion maximal ellipses fulfilling the fitting property. 
One way to avoid misinterpretation of streaks is to exclude ellipses of an "extreme" shape, i.e., to 
require a/6 6 [1/p.p] for a suitable constant p > 1. This also reduces the complexity of E. The 
selection off C E can then be easily done in a greedy fashion. We implemented this approach, 
however both runtime and the quality of the solution did not meet our expectations, see e.g. Figure 5. 
In Subsection 2.3 we study from a theoretical point of view how to improve on both, making use of 
advanced geometric data structures and techniques. In Subsection 2.2 we present an alternative LP- 
based approach, which avoids the explicit construction of a large basic set E, and instead combines 
construction and selection of ellipses. Experimental results show that both quality and runtime of the 
LP-approach are superior to the simple brute force solution. 

2.2 Using an LP Approach to Generate Ellipses 

In the implementation we assumed that the region R is simply connected and rectilinear. However, 
it is straightforward how to extend the algorithms to arbitrary polygonal regions. Aiming at better 
runtimes, C and F are samples of those points inside and outside of R which are close (one pixel 
distance) to the boundary of R. The basic rule of the sampling heuristic we apply is to choose a 
denser sample for C (resp. F) in convex (resp. concave) boundary parts with high curvature. 

Observe that each element in F forms an outer constraint for each ellipse in the cover because of 
the fitting condition. On the other hand, we want at least a few points (say, at least three) from C to be 
included in an ellipse of the covering. Therefore, we start from a randomly chosen triplet of mutually 
visible points from C and ask whether there is an axis-parallel ellipse containing these points so that 
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Figure 6: Three covering proposals for a complex region generated by the LP-approach 



it does not violate an outer constraint. Once having the information that for a given triplet there is a 
feasible solution one can efficiently extend the covered sample subset by an LP-based approach. 

We start from the observation that the parameters of an ellipse E can be transformed into variables 
of an LP in such a way that each of the three points which have to be covered by E adds a linear 
constraint to the outer constraints. The constraints are derived by plugging a point p = (p x ,Py) into 
the ellipse formula ^ J n l' :) ~ + {Pv ^ d) — 1 which has to be zero (negative, positive) for points on (inside, 
outside) E. Since this is not a linear constraint we start with one of the form 

PxX\ - pyX2 + PyXi -Xi=p 2 x (resp. <, >) (2) 
By elementary calculation, setting t — ^ + ^ _ we derive as ellipse parameters 

<-{l °=! ■ < 3 > 

Given the existence of a feasible solution (not an explicit ellipse yet!) one wants to extend the point 
set that can be covered. Clearly, it makes sense to add and check first the neighbors of the already 
covered sample points. This is also done in a random way. However, it is clear that one can easily get 
stuck when there is a very restrictive partial solution, in the sense that we have still a feasible system 
but the actual solution set of ellipses is very small and does not allow to add a new sample point. 
To implement the necessary backtracking step, we have adapted a simplified version of the so-called 
Metropolis methodology, see for example [19], where it has been used for the generation of large 
cliques in graphs, a situation similar to ours. 

This random Markov-chain like process is organized in rounds. With probability q > 1/2 we have 
a round "extend", that is we choose and try to add a random neighbor; with probability 1-gwe have 
a round "backtrack" in which we delete a randomly chosen point from the already covered point set. 
After a fixed number of rounds we actually compute an explicit candidate ellipse to be included next 
into the partial covering. Remark that the shape of such an ellipse is almost completely determined by 
the covered points and by the outer constraints. To arrive at a unique solution, the LP-solver chooses 
that one with halfaxes ratio closest to 1. This is repeated a constant number of times and we select the 
candidate that covers the maximum number of points; ties are broken randomly. 

There remains to explain the incorporation of the intersection condition into the LP-approach. We 
delete already covered points from C. For each selected ellipse E we add new "outer" constraints to 
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F by sampling the interior of E. These new points prevent later chosen ellipses from having large 
intersections with E. Again, this scheme is run until the covering condition is met. In the example 
depicted in Figure 6 the black pixels represent C. The dots outside the region describe the initial outer 
constraints, within the chosen ellipses the additional outer constraints are also indicated. Together 
they form the final set F. Remark that, willing to spend more runtime, the quality of the covering 
can be improved by choosing a denser sample. Observe that, apart from the differences in the three 
coverings depicted in Figure 6, they in fact represent very similar point patterns when considering the 
centers of the ellipses only. 

2.3 Related Theoretical Issues 

We want to discuss the ellipse covering problem from a more theoretical point of view. As in Subsec- 
tion 2. 1 let a connected pixel pattern R be given. We identify a pixel with its center, and assume that 
pixel centers lie on an e-grid. Let F be the set of grid points not in R that have distance e to R. Let 
n = \F\, which yields \R\ — - O(rr). In this subsection we consider the case that the set C of points 
to be covered consists of all pixel centers in R, thus R — C. 

The ellipse covering problem is to find a collection £ = {E} , E^] of minimal cardinality 
of axis-parallel open ellipses such that the union U = \J E{ covers R, and respects F. An ellipse 
respects F if it contains no point of F in its interior, however possibly on its boundary. 

This covering problem is a simplified version of our original covering problem. It shares the fitting 
condition, and a special case (e = 0) of the covering condition. Note that this restriction can easily 
be relaxed by including e into the stopping condition of the algorithm described below. Nevertheless 
it is still a challenge to incorporate an additional requirement like the intersection condition into the 
approximate set covering framework. 

The problem of (approximately) covering a shape with ellipses is strictly related to the problem 
of exact covering of a shape with rectangles, which was shown to be NP-complete [12]. It is also 
related to the problem of covering a shape with strips [1], and to the range covering problem in a 
hypergraph [8], Thus, in the general setting there is not much hope for finding a polynomial time 
algorithm. However it is possible to employ the Bronnimann & Goodrich paradigm [8] in order to 
obtain a polynomial time algorithm that computes a covering with ellipses defined by grid points, 
such that the cardinality of the cover is k op t logfc opt , where k op \ is the cardinality of the optimal-size 
solution. The runtime of a straightforward implementation of the paradigm is 0{n 5 k opt log n). In 
the sequel we present a faster implementation which runs in 0(n 3 ), where O denotes a variant of the 
O-notation which subsumes polylogarithmic factors. 

First observe that we can restrict our search to maximal ellipses. An ellipse E is y-maximal if 
(1) its center is in a grid point of R, it (2) does not intersect F in its interior, (3) among all ellipses 
with the same center and width E has the largest height, ^-maximal ellipses are defined analogously. 
An ellipse is maximal if it is either ^-maximal or y-maximal. Note that each maximal ellipse has a 
point of F on its boundary. A point of F on the boundary of an ellipse E is called a dominator of E. 
Clearly, we can assume that the optimal cover consists of maximal ellipses. 

Let E be the set of all possible maximal ellipses with centers in R, and width or height defined 
by grid points. There are are Our) points p in R, and each p gives raise to 0{u) maximal ellipses 
centered atp, since there are O(n) different possible widths and heights. Thus |E| = 0(n 3 ). The rest 
of this section is dedicated to explain how to find a collection £ = (JBq . . . E^} C E of cardinality 
k — 0(k opi log&opt) whose union respects F and covers R. Note that fc op t = 0(n) since one can 
choose a thin ellipse covering all points in a row or column of R between two points in F. 
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The Bronnimann & Goodrich approach assumes that A; 0 pt is known, thus we assume in the se- 
quel that fcopt is given. It can be determined afterwards by applying unbounded binary search. The 
overview of the Bronnimann & Goodrich algorithm is as follows: It gives a weight w(E) to each 
ellipse E G E, initially all set to be 1. The algorithm consists of rounds. In each round a random 
sample £ of size ck opt log A,- opt of the ellipses of E is picked, where the probability of an ellipse to be 
picked is proportional to its weight, and c is a constant. Next the algorithm checks if £ covers R. If 
this is not the case, we find a point q e R which is not covered (arbitrarily chosen if there are more 
than one), and double the weight of each E G £ q , where £ q C E is the set of ellipses containing q. 
Then the algorithm continues with the next round and stops when a cover is found. Since the Vapnik- 
Chervonenkis dimension of the problem is clearly finite, it follows from [8] that the expected number 
of rounds is 0{k opl log n). 

In a somehow similar fashion to [1], we construct a data structure that enables us to (implicitely) 
modify the weight of all ellipses in £ q . We describe the process for y-maximal ellipses. z-maximal 
ellipses can be handled in an analogous manner. 

For p G R let e w (p) denote the y-maximal ellipse of width w whose center is in p. Let r be a 
row of grid points of R, and let w be a fixed width. Using the property that the boundaries of e w {p\) 
and e w (p2), P\,P2 G r, intersect at most once above r, it is straightforward to show (c.f. [13]) that 
for a given s G F the region {p £ r|.s is a dominator for e w (p)} forms a connected interval on r. 
Similarly it follows that for q G R the set I w , r (q) := {p G r\q G e w (p)} is also an interval on r. Now 
consider a partition of r into maximal connected intervals, such that in each interval the dominator 
that determines e w (p) is the same. It follows that the complexity of this partition is 0(n). Using 
standard divide and conquer we can compute this partition in O(nlogn) time. In the first stage of 
the algorithm we compute and store these partitions for all rows and all widths. This requires 0(n 3 ) 
space and 0(n 3 log n) preprocessing time. For a given q G R and width w we can now compute the 
interval I WJ {q) using binary search along r in O(logrc) time as follows: For a p G r we locate p in 
the partition of r, and compute e w (p) in constant time. If q £ e. w (p) we can deduce if (i) for each 
p' G r to the right ofp, q £ e w {p'), or (ii) for each p' G r to the left ofp, q (£ e w {p'). 

After picking a random sample £ C E of ellipses, we need to see if £ covers R. For this, we 
compute U£ explicitly, (using for example a line sweep technique). Since the size of £ is O(k) the 
complexity of this union is 0(k 2 ), and it can be computed in time 0(k 2 log A;). Using point location 
we can find the first point p G R \ U£ in 0{n 2 log k) time. 

In order to maintain the weight of the ellipses efficiently, we do not construct the weight of each 
ellipse in E explicitly. Instead, for each row r and each possible fixed width w we construct a binary 
balanced search tree T r , w on the grid points of r, sorted by their .x-coordinatc. Each leaf v G T r , w 
is associated with the maximal ellipse of width w whose center is (at the point) v. Each inner node 
H G T r , w maintains a factor ^ /( , and the sum of the weights of all leaves in its subtree (up to common 
factors on the path to the root). The weight of a leaf is the product of all factors on the path to the 
root. For a node p. let 7 M C r be the set of all points corresponding to leaves in the subtree at In 
order to double the weights of all maxima! ellipses of width w whose center lie on T Wtr (q), we double 
the factor lo^ for each node //, for which 1^ C I w<r (q), but If a ther(ii) 15 not contained in I W;r (q). 
Analogous to standard segment trees, we see that we need to update only O(log n) factors and sums 
of weights, and that computing the weight of a specific maximal ellipse is doable in O(logn). Picking 
the random subset is done as follows. We pick a tree at random, according to the sum of weights of 
leaves of the tree which is stored in the root. Next we compute a random path in the tree from the 
root to a leaf, where at an inner node the probability of branching to the left or the right child depends 
on the stored sum of weights of the leaves in the subtrees. The ellipse we pick is the maximal ellipse 
associated with the leaf that ends the path. Thus picking one random ellipse needs O(logn) time. 
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Altogether we need 0(n 3 logn) preprocessing time to construct the trees and the partitions. In 
each round, of which there are 0(k opt log n) many, we pick a random sample in 0(k opt log fc opt log n) 
time, compute the union and find an uncovered point q which needs 0{(k 2 + n 2 ) log A;) time, and 
compute for all w and r the interval I w , r (q), and update the weights of the points in I w , r (q), which 
can be done in 0{n 2 log n) time. This yields a total runtime of 0(n 3 ), where O denotes a variant of 
the 0-notation which subsumes polylogarithmic factors. Thus we obtained the following theorem: 

Theorem 1 Let R, F and E he as above, and let n — \F\. Then in expected time 0(n 3 ) one can find 
a set £ C E of ellipses whose union covers R and avoids F. The cardinality of 8 is 0(k opt log k opt ), 
where k opl is the cardinality of an optimal solution. 

Note that the total runtime is dominated by the 0(n 3 ) term for the preprocessing. We are opti- 
mistic that, using 2-dimensionaI planar subdivisions into regions with the same dominator, and con- 
structing only those parts being accessed in the trees, we can obtain a runtime of 0(n 2 k opt ). This is 
the subject of our ongoing research. 

3 Computing Global via Local Matchings 
3.1 Modeling the Matching Problem 

Assume that a source image S and a target image T are given by their spot lists. Recall that after 
the spot detection a "spot" is simply a vector {x(s),y(s), i(s)) consisting of its nonnegative point 
coordinates (x(s),y(s)) in the Euclidean plane and a positive real number i(s) describing its intensity. 

By this simplification of spots we can treat the geometric matching task as an approximate point 
pattern matching problem, see [2] for a survey on this field. There, the general setting for our bottle- 
neck matching type problem is as follows. For a real number e > 0, a group of admissible transfor- 
mations A (e.g. translations, rigid motions and/or scalings) and a metric d, one seeks a bijection / 
between as large as possible subsets S 1 C S and T" C T, so that there exists a transformation g € A 
for which d(g(s),f(s)) < e for every s e S'. Thus, the transformation g describes the geometric 
resemblance between S' and V . Additionally, we want that the intensity t(s) for s e S' resembles 
the intensity of f(s). However, this is only the general framework for our solution. There are several 
major difficulties stemming from the inherent noise in the electrophoresis process: 

1. 2-DE images, even of the same probe, can differ significantly and the offset vectors for match- 
ing spots are only locally almost equal. 

2. The spot resolution of the images can be different. Moreover, given a correct matching the 
intensity orderings of matched spots in both images are similar but not identical. 

3. It often happens that one has to compute a matching of images that only partially "overlap". 
Previous algorithmic solutions assume that both images show the same frames, or at least there 
are preset landmark pairs that allow to initialize the matching procedure. We neither want to 
choose the matching frames nor landmarks by hand. 

To cope with these problems we choose the following 2-stage approach, which we call the global- 
via-local matching paradigm. See Figure 7 for a flow chart. 

1 . In a first step we compute a rather dense set of landmarks. This is realized via local matchings. 
A local matching is the registration of a small local source pattern (consisting of locally most 
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If there is such t, augment pair (s,t) to M — > Extended Matching M 



Global Matching - second stage : Extension Loop | 

Figure 7: Flow chart of the matching algorithm. 



intensive spots) in the target. For a single pattern the result is typically not unique, yielding dif- 
ferent matching proposals. But for a set of grid-like located local source patterns we can choose 
a consistent common registration (i.e., matching) by comparing the underlying transformations, 
which should be similar for neighboring patterns. 
2. A second step implements the extension of the landmark matching to neighboring spots. This 
is done by a variant of the standard neighborhood graph comparison. Moreover, in this step we 
take advantage of the extra information concerning ambiguous regions that was gathered in the 
spot detection preprocessing. 

Next we formalize the geometric matching criterion for the registration of a local pattern P C S in 
the target T. 

We call two line segments ss' and it' (A, a)-similar if the absolute difference of their angles with 
the x-axis is smaller than a and for their lengths we have: 

1 - A < \~s~s 1 \/\tt l \ < 1 + A (4) 

Two point p atterns P C S and Q C T (A, «)-match if there is a bijection / between the point sets so 
that ss' and f(s)f(s') are (A, «)- similar for all s, s' £ P. In sum, from the geometric point of view 
we want to find (A, a) -matchings / between as large as possible subpatterns P 1 C P and local target 
patterns Q, compare also [21]. 

As described in [17], we cannot model intensity resemblance by directly comparing i(s) and 
i(f(s j), instead we rank, in both source and target, the spot intensities into 10 groups and require that 
the matching respects those ranks within a tolerance of 1. 

3.2 How to Compute Local Matchings 

In this subsection we present an efficient algorithm which computes partial approximate matchings of 
a pattern P in a target point set T. The algorithm has to find maximal subsets P' C P (at least 50 
%), corresponding subsets QCT and transformations t approximating the matching between P' and 
Q, Firstly, we discuss the transformation class of translations, but, we always keep in mind possible 
generalizations to nomothetic transformations (translations plus scalings). In the runtime analysis k 
and n will denote the number of points in P and in T, respectively. 

The algorithm is basically a twofold refinement of the naive alignment approach which computes 
a geometric matching by aligning a pattern edge e £ P with a similar edge e' in T. If one wants 
to find full matchings of P, it is sufficient to fix e and search for all similar target edges. However, 
looking also for partial matchings, one has to check all pattern edges against all edges in T. Each 
edge similarity induces a translation t,. e i which maps the midpoint of e onto the midpoint of e'. Then 
candidate points for a matching can be found by nearest neighbor queries for the k points of the trans- 
formed pattern t e ^(P). Hence, the runtime is bounded by 0(k 3 n 2 log n) which combines the k 2 n 2 
edge comparisons with the 0(k log n) time to compute a matching. However, the algorithm is much 
faster in practice because most of the edges similarity tests are answered negatively. 

Remark 1: It is not hard to construct examples where the nearest neighbor q £ T of a transformed 
pattern point does not satify the requirements of a geometric matching whereas another neighbor q' 
does. Therefore all neighbors within a certain distance have to be checked. 

Remark 2: Looking only for matchings approximated by translations it v.ot.id be sufficient to check 
all transformations defined by point pairs (p,q) £ P x T. However, such an approach is not very 
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suitable in practice because it is not extendable to homothetic transformations and, moreover, the re- 
sulting 0(k 2 n log n) time bound is tight. 

Remark 3: In [18] a randomized version of the alignment method has been studied. One of the 
results is a 0{hr log n) Monte Carlo algorithm computing all partial matchings which match at 
least a constant fraction of P. 

Instead of randomization we make use of the intensity resemblance which should hold for a majority 
of spots. To this end we order our point sets according to decreasing intensity. A triangulation of 
a point set X in the plane is called Delaunay triangulation if for each triangle in the triangulation 
its circumcircle contains only the three triangle points. One can construct such a triangulation in an 
incremental way by inserting points one by one in the given order, compare [15]. The insertion of a 
point can destroy some of the previous edges and add some new edges. The history Hist(X) of the 
incremental triangulation is the set of all Delaunay edges which occur at some time in this process. 
There are two results from Computational Geometry which together make up a fast and very robust 
heuristic. 

Theorem 2 (see [3]) Let X be a point set with an intensity ordering. Assume that a pattern Q C X 
consists of the most intensive points in a rectangle R. Consider a triangle A occuring during the 
incremental Delaunay triangulation ofQ. If the circumcircle of A is contained in R, then each edge 
of A belongs to Hist(X). 

This implies that under ideal assumptions (of nearly identical images) it suffices to consider all 
similarities between edges from Hist(P) and Hist(T). In practice however one cannot start from these 
assumptions: the matching is only approximate (local distortions), sometimes partial (missing spots), 
and moreover the intensity orders can vary (spots of regulated proteins). iNevertheless the Delaunay 
Triangulation approach is still suitable for the following three reasons: 

1. The algorithm below is designed in such a way that it is not necessary to have for all edges of 
Hist(P) similar counterparts in Hist(T). 

2. Augmenting Hist(T') with all flip edges which occur in the incremental triangulation we obtain 
the so called extended history Hist*(T). This increases the chance to find similar edges; see 
[17] for some experimental results. 

3. Even if the intensities of some spots in P differ heavily from thcii corresponding spots in T, 
there is a high chance to find the partial matching of the remaining pattern points. Based on this 
partial matching, geometrically corresponding spots with different intensities can be found in 
the extension stage, see Subsection 3.4. 

The following fact implies an estimate on the expected size of Hist(T). 

Theorem 3 (see [24]) Let X be a random permutation of u points. Then the incremental Delaunay 
triangulation of X can be computed in 0(n\ogn) expected time. The expv Aed number of edges in 
Hist(X) is (){n). 

Therefore the expected size of Hht*(X) is also linear. Summing up the restriction of the align- 
ment approach to the comparison of edges from Hist*(P) and Hisf*(T) implies a matching algorithm 
with expected time complexity ()(lrv, log n). 
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In fact, although this first improvement reduces the computation lime remarkably, there is still 
a lot of wasted work because for any edge similarity e ~ e' the algorithm attempts to construct a 
matching approximated by translation t efi > in 0(k logn) time. However, most of these attempts are 
bound to fail. We call a transformation t e e > good if it approximates a matching of at least half of the 
points in P and bad otherwise. 

The standard approach of how to avoid testing bad transformations is geometric hashing. Orig- 
inally hashing has been worked out for the exact matching problem. The idea is to collect first all 
candidate transformations in a hash table. Note that in contrast to bad transformations the good ones 
occur at least [ k ! 2 ) times. In the end, the good transformations can be selected and, moreover, for 
each of them the corresponding matching has to be computed only once. 

Here wc develop a special variant of geometric hashing. We replace the hash table by a scoring 
scheme which uses a regularly spaced in x in grid over T with m 2 variables counting scores for all 
grid nodes. We discuss later how to choose m. This structure is suitable to compute also approximate 
matchings and it beats hash tables in both time and storage complexity. One has to pay for these 
advantages by accepting a certain heuristic flavor in this method. Let cy> denote the center of the 
bounding box of the pattern P. Again we start with all translations /, ,,• defined by similar edges 
e G Hist*(P) and e' e Hisf(T). We store t e y adding scores to the four nodes of the grid cell which 
contains t ee i(cp). It is straightforward that good transformations yield clusters of points which in 
turn are represented by high scores in the neighboring grid nodes. Altogether we get an expected 
runtime of 0{n log n + kn + m 2 + |7^ ooc j| A; log n), where |7^ 00ti ! denotes the number of good 
transformations, i.e., the number of computed partial matchings. The first and second term estimates 
the triangulation construction and the edge comparisons. Since the last term is of smaller order the 
number of grid nodes, i.e. in 2 , plays the key role in the algorithm's analysis. Finer grids can display 
the cluster positions more precisely than coarser grids (with less nodes) 

There is one more trick that allows to have both rather coarse grids awl n sufficient precision of 
cluster positions. Each translation representative t e>e >(cj>) subdivides its grid cell into four rectangles. 
Instead of unit scores, each of the four grid nodes adds to its current score an amount proportional 
to the area of the opposite rectangle. This way the precise position of a single point < e ,e'( c p) can be 
recomputed from its scores. Let Score(z, j) be the total score accumulated in grid node after 
probing all edges from Hist*(P). All local maxima in the grid that are lm eater than a threshold value 
depending on \P\ are considered to correspond to potential matching locations. 
Eventually, we can approximate the actual center (i c ,j c ) of the vector cluster stemming from a local 
maximum at node by computing a weighted average of the scores at and all scores at 
neighboring grid nodes: 

. E U',-iE^-iScorei.A-,Q-(fc,/ ) 

J E^-tE^-i Score(M) " ( ' 

The generalization from translations to nomothetic transformations is straightforward. (A, a)-similarity 
has to be replaced by o-similariry (the angle between the edges is at nn>M .el for each a-similar 
pah (e, e 1 ) the transformation t K e i is defined by the scaling factor \e'\/U ] and the translation mapping 
the midpoint of the scaled e onto the midpoint of c'. Consequently, !he two -dimensional scoring 
scheme has to be replaced by a three-dimensional structure with the third dimension representing the 
scaling factors. In our implementation we used an exponential scale of the ihird axis where the factor 
between two grid units is A. 

Finally wc have to compute the local matchings corresponding to good transformations t G 7^ 00( j. 
To this end the whole pattern P is transformed: t(P) — {t(p) \p G P). Each transformed point t(p) 
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marks a position in the target image and we are seeking for matching car ,; dates p' 6 T close to 
this position. As mentioned earlier the nearest neighbor to a given position Is not always the correct 
matching partner. We call an incorrect matching of a point p with a point p' which is close to the 
position t(p) a switch. To illustrate this problem suppose that there arc two very close spots p\,P2 
in the pattern, p\ on top of po- Since the correct matching partners p\ . i> 2 £ T are also very close it 
can happen that the lower target point p' 2 is closer to the position t(p\ ) than the top point p[. Here 
the naive nearest neighbor strategy would decide for the switch {p\,p',)- Clearly, one can make the 
correct decision and avoid the switch by comparing the neighborhoods: Both, p\ andp'j have a close 
neighbor below, whereas, p' 2 has not. Consequently, pi should be matched with p[ rather than with 
p' 2 , although p., is closer to the expected position t(p\). 

To implement this idea we introduce small boxes B(p) centered at the r. sitions t(p) and consider 
all p' £ T n B{p) as potential matching candidates for p. The favourite .--lection depends on the 
neighborhood similarity of p and p 1 . Here the neighborhoods are defined by tr • Delaunay triangulation 
histories which had been computed during the preprocessing of the source and target point sets. This 
way there are no additional expenses for the neighborhood computation, h is essential to remark that 
the reliability of neighborhood-based decision depends on the correctnes <. f (he spot detection. If the 
protein spots p\ and p 2 in the above example are so strongly overlapping that they were detected as 
one single spot p 1 , the neighborhood comparison cannot help. At this point the different proposals 
computed in the spot detection come into play. Omitting all spots stemming from ambiguous regions 
in the local matching procedure one can improve the reliability of the ma.ching result. Consequently 
it will be necessary to compute the matchings of spots from ambiguous regions in a postprocessing. 

3.3 Assembling a Global Matching from Local Matchings 

The local point pattern matching described in the previous subsection is the central module of our 
algorithm. We compute the global matching of two images S and 7 n jveral stages. First the 
source image is segmented into equally sized, rectangular subimages. In our implementation we used 
a regular 5 x 5-grid for this subdivision. Then for each of the 25 subimages the pattern formed by 
its 12 most intensive unambiguous spots is selected (25 and 12 are pan meter values which proved 
to be optimal in numerous tests). Applying the local matching to those local patterns we get a list 
of matching proposals for each pattern. Now we apply a simple consistent y test to those proposal 
lists. Two local matchings (of different patterns) are consistent if the underlying transformations are 
approximately the same. Finally a first global matching is established by searching for the largest 
family of pairwise consistent local matchings. This first global matching is then used as a skeleton 
M of landmarks for further extensions. Note that by the chosen paranv. ' .If has at most 300 spot 
pairs. 

Another advantage of our approach is the possibility to compute a "doc 1" matching for images 
which only partially overlap, see Figure 8. In contrast to commercial progr; ins we do not assume a 
global alignment of the images. This feature is useful for reconstructing big gel images from several 
partially overlapping subimages. 

3.4 How to Extend a Global Matching 

The purpose of the matching extension procedure is twofold. Firstly, v • want to match unambiguous 
spots which, according to their lower intensity, were not in the initial u cal patterns. Secondly, we 
have to deal with spots from ambiguous regions. The first problem is sob ed by following a standard 
approach: The pairs in the skeleton M form a set of landmarks which defines a piecewise affine 
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Figure 8: A global matching of two images which have only a small hoezontal : : rip in common 



transformation tu- Again, the Delaunay triangulation proves to be a use 
We construct both the Delaunay triangulation DT source of the previous!;, 
and DT taTgf .i of the corresponding target points T_\[. In order to map a no 
Delaunay triangle A = s 2 , S3) in DT source by point location. If p is 0 
Sm one can choose the Delaunay triangle A closest to p. There is a uniq 
mapping the triangle vertices onto their partners (*i,i2, £3)- Thus, the exp. 
in the target image is <a/(p) := *a(p)- Note that the expected time for 1. , 
0( log I Sm I). The matching candidates and the choice of a matching <\\ 
explained in Subsection 3.2. 

The extension procedure is iterated several times. After the first extciv 
resulting skeleton is fine enough to attack ambiguous regions. To this end 
terminology As before we call a spot unambiguous if it was determined 
tion. Each ambiguous region is represented by a so called superspol whit 
to consider she whole region as one complicated unit. The coordinates 
by the barycenter of the region. Spots contained in an ellipse covering p 
of the superspot. We make one more simplification focussing our attentic 
only on the superspot and on the subspots of the best proposal computed h ; 
this simplification is appropriate for twin and triple spots rather than for ■ 
are two reasons for proceeding this way: The comparability of the cdg.s 
nay triangulation is the basis of the local matching approach. However, ; > 
proposals (describing the same region) would yield lists which do not " : 
geometric situation. Instead, inserting the subspots of the best proposal on : 
get similar histories of the source and the target image. Therefore these s;. 
incremental Delaunay triangulation, although they are neither taken as loe: 
nor as matching candidates in the target. Moreover, considering more than 
a combinatorial explosion in the search for candidates and the decision 
hood of a superspot (which itself is not inserted into the Delaunay triangu 
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the set-theoretic union of all subspot neiborhoods without the subspots tiu Ives. 

Starting with the second extension step all superspots and their (bcs 1 p.oposa) saaspots in both 
the source and the target image are taken into account. Thus, on the source aide there are new spots p 
to be matched. On the target side there are new candidates. 

We illustrate the advantage of our approach considering once more i' a c!croci •■•try twin spot sit- 
uation. In the source (resp. the target) image there are three possible results the . ot detection can 
return: 

1. The overlapping is so strong that the common region is detected as e re un nbiguous spot 5 
(resp. t). 



2. The overlap has been detected, and thus the common region is am', 
consists of two spots si , s 2 (resp. i x , t 2 ) which are subspots of a so 

3. The algorithm has detected two separated unambiguous spots sj , 
Depending on the combination of these events and provided that the outsit' 
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In the traditional approach without ambiguous regions only cases 1 aa. ' a mi o . . ar. Consequently 
we would obtain a matching only, if the spot detection makes the same d . eisic n on I- sides (column 
(\) and (3)). Now in our approach a matching will additionally be discover .! if ;:: least on one side 
the ambiguity (case 2) was recognized. This way the overall matching rest, ja.i o-e used to resolve 
ambiguities in the results of the initial spot detection. In our view this is 1 - aiisio ; attempt to over- 
come the separation between preprocessing and matching, and thus to sin jL.:. hew an expert would 
compare images by visual inspection, namely, by intertwining the spot detection ith the matching 
process. 



4 Implementation Issues and Experimental Result 

The 2-DE analysis software system CAROL ([9]) contains the describee! v. 1 
algorithms. In order to facilitate its usage over the Internet it consists o 
part, the combinatorial and geometrical kernel of the spot detection and 11 
been implemented in C++. It makes essential use of the Standard Tempi; 
the Computational Geometry Algorithms Library (CGAL) [10]. The latto 
geometric data structures and functions and especially an implementation 
triangulation. For the LP approach of the spot detection, which we implci ji. 
we use the commercial LP-solver CPLEX. 

The second part of the CAROL system is the graphical user interface v ■ ic 
in Java. It can be run as an applet started out of an internet browser or a.- a: .. 
nication with the algorithmic procedures is established via internet soc' . . v. 
works as a server which waits for matching or spot detection requests h 11 ; 
the computation, and eventually sends the results back to the client. 
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Using CAROL it is possible to analyze gel images from databases 
feature is strongly supported by the client-server architecture and the po: 
terface out of an internet browser. However, a direct Internet access is re' 
by a firewall. CAROL offers the possibility to open GIF images from ar 
out the spot detection, to perform local or global matching* between l\ 
rameters like tolerance bounds, pattern size, etc. The current versior 
http gelmatching. inffu-berlin. de. 

The local matching algorithm executed on a Sun Ultra Sparc 30i % 
matchings for a pattern of twelve spots in about 0.2s. The first stage of th 
two gel images with 900 and 1000 spots takes about 7s and yields aboa 
After this step the user can optionally correct the proposed landmarks. A 
takes 2s. The computation of the spot detection for a full gel image (12f' n 
takes about half a minute without complex regions; in such large images ■ 
20 complex regions. The brute force greed)' approach for the compl.n - 
a 76 x 80 array, with 59 inner sample points drawn in black) needs 2<i 
covering indicated. The LP-based solution (with initially 76 points defin 
13s to compute the indicated solution. However, we remark that usu 
computed once before storing the image in a database. 

Tests of the CAROL system on several series of gel images as >.\ 
from external test users have confirmed the advantages of our geometri • ■ 
concentrate on a more accurate mathematical modeling of the biocheme 
within the electrophoresis process. 
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